Bilingually motivated segmentation and generation of word translations using relatively small translation data sets
نویسندگان
چکیده
Out-of-vocabulary (OOV) bilingual lexicon entries is still a problem for many applications, including translation. We propose a method for machine learning of bilingual stem and suffix translations that are then used in deciding segmentations for new translations. Various state-of-the-art measures used to segment words into their sub-constituents are adopted in this work as features to be used by an SVM based linear classifier for deciding appropriate segmentations of bilingual pairs, specifically, in learning bilingual suffixation.
منابع مشابه
Bilingually Motivated Word Segmentation for SMT
We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight that PB-SMT systems can be improved by optimising the input representation to reduce the predictive power of translation models. We firstly present...
متن کاملBilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for t...
متن کاملBilingual Segmentation for Alignment and Translation
We propose a method that bilingually segments sentences in languages with no clear delimiter for word boundaries. In our model, we first convert the search for the segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic-programming solution, and incorporate a control to balance monolingual and bilingual information at hand. Our bilingual segmentation algorithm, th...
متن کاملA Bayesian model of bilingual segmentation for transliteration
In this paper we propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the o...
متن کاملEffects of Integrating Multiple Bilingually-Trained Segmentation Schemes for Japanese-English SMT
This paper proposes a method to integrate multiple segmentation schemes into a single statistical machine translation (SMT) system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating Japanese into English revealed that the proposed method of integrating multiple segmentation schemes outperforms ...
متن کامل